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Abstract 

Express  cubes  are  k-ary  n-cube  interconnection  networks  augmented  by  express  channels 
that  provide  a  short  path  for  non-local  messages.  An  express  cube  combines  the 
logarithmic  diameter  of  an  indirect  network  with  the  wire  efficiency  and  ability  to  exploit 
locality  of  a  direct  network.  The  insertion  of  express  channels  reduces  the  network 
diameter  and  thus  the  distance  component  of  network  latency.  Wire  length  is  increased 
allowing  networks  to  operate  with  latencies  that  approach  the  physical  specd-of-light 
limitation  rather  than  being  limited  by  node  delays.  Express  channels  increase  wire 
bisection  in  a  manner  that  allows  the  bisection  to  be  controlled  independent  of  the  choice 
of  radix,  dimension,  and  channel  width.  By  increasing  wire  bisection  to  saturate  the 
available  wiring  media,  throughput  can  be  substantially  increased.  With  an  express  cube 
both  latency  and  throughput  are  wire-limited  and  within  a  small  factor  of  the  physical  limit 
on  performance.  Express  channels  may  be  inserted  into  existing  interconnection  networks 
using  interchanges.  No  changes  to  the  local  communication  controllers  are  required. 
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Abstract 

Express  cubes  are  k-ary  n-cube  interconnection  networks  augmented  by  express  channels  that 
provide  a  short  path  for  non-local  messages.  An  express  cube  combines  the  logarithmic  diameter 
of  an  indirect  network  with  the  wire-efficiency  and  ability  to  exploit  locality  of  a  direct  network. 
The  insertion  of  express  channels  reduces  the  network  diameter  and  thu6  the  distance  component 
of  network  latency.  Wire  length  is  increased  allowing  networks  to  operate  with  latencies  that 
approach  the  physical  speed-of-light  limitation  rather  than  being  limited  by  node  delays.  Express 
channels  increase  wire  bisection  in  a  manner  that  allows  the  bisection  to  be  controlled  independent 
of  the  choice  of  radix,  dimension,  and  channel  width.  By  increasing  wire  bisection  to  saturate 
the  available  wiring  media,  throughput  can  be  substantially  increased.  With  an  express  cube 
both  latency  and  throughput  are  wire-limited  and  within  a  small  factor  of  the  physical  limit 
on  performance.  Express  channels  may  be  inserted  into  existing  interconnection  networks  using 
interchanges.  No  changes  to  the  local  communication  controllers  are  required.  Ct\U.)  i 
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1  Introduction 

^interconnection  networks  are  used  to  pass  messages  containing  data  and  synchronization  infor¬ 
mation  between  the  nodes  of  concurrent  computers  The  messages  may  be  sent 

between  the  processing  nodes  of  a  message-passing  multicomputer  i^or  between  the  processors 
and  memories  of  a  shared-memory  multiprocessor^; 


An  interconnection  network  is  characterized  by  its  topology,  routing,  and  flow  control  [10].  The 
topology  of  a  network  is  the  arrangement  of  its  nodes  and  channels  into  a  graph.  Routing  de¬ 
termines  the  path  chosen  by  a  message  in  this  graph.  Flow  control  deals  with  the  allocation  of 
channel  and  buffer  resources  to  a  message  as  it  travels  along  this  path.  This  paper  deals  only 
with  topology.  Express  cubes  can  be  applied  independent  of  routing  and  flow  control  strategies. 


The  performance  of  a  network  is  measured  in  terms  of  its  latency  and  its  throughput.  The  latency 
of  a  message  is  the  elapsed  time  from  when  the  message  send  is  initiated  until  the  message  is 

‘The  research  described  in  this  paper  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency 
under  contracts  N00014-88K-0738  and  N00014-87K-0825  and  in  part  by  a  National  Science  Foundation  Presidential 
Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation  and  IBM  Corporation. 
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completely  received.  Network  latency  is  the  average  message  latency  under  specified  conditions. 
Network  throughput  is  the  number  of  messages  the  network  can  deliver  per  unit  time. 

Low-dimensional  Jfc-ary  n-cube  networks  using  wormhole  routing  have  been  shown  to  provide  low 
latency  and  high  throughput  for  networks  that  are  wire-limited  [4]  [5]  [9].  For  n  <  3,  the  fc-ary 
n-cube  topology  is  wire-efficient  in  that  it  makes  efficient  use  of  the  available  bisection  width.  This 
topology  maps  into  the  three  physical  dimensions  in  a  manner  that  allows  messages  to  use  all  of  the 
available  bandwidth  along  their  path  without  ever  having  to  double  back  on  themselves.  Also,  low¬ 
dimensional  ifc-ary  n-cubes  concentrate  bandwidth  into  a  few  wide  channels  so  that  the  component 
of  latency  due  to  message  length  is  reduced.  In  most  contemporary  concurrent  computers,  this  is 
the  dominant  component  of  latency.  Because  of  their  low-latency,  high  throughput,  and  affinity  for 
implementation  in  VLSI,  these  fc-ary  n-cube  networks  with  n  =  2  or  3  have  been  used  successfully 
in  the  design  of  several  concurrent  computers  including  the  Ametek  2010  [17],  the  J-Machine  [7] 
[8],  and  the  Mosaic  [18]. 

However,  low-dimensional  jfc-ary  n-cube  interconnection  networks  have  two  significant  shortcom¬ 
ings: 

•  Because  wires  are  short,  node  delays  dominate  wire  delays  and  the  distance  related  compo¬ 
nent  of  latency  falls  more  than  an  order  of  magnitude  short  of  speed-of-light  limitations.  In 
the  J-Machine  [7],  for  example,  node  delay  is  50ns  while  the  longest  wire  is  225mm  and  has 
a  time-of-flight  delay  of  1.5ns. 

•  The  channel  width  of  these  networks  is  often  limited  by  node  pin  count  rather  than  by 
wire  bisection.  For  example,  the  J-Machine  channel  width  is  limited  to  9-biu  by  pin  count 
limitations.  In  the  physical  node  width  of  50mm,  a  6-layer  printed  circuit  board  can  handle 
over  four  times  this  channel  width  after  accounting  for  through  holes  and  local  connections. 

In  short,  many  regular  fc-ary  n-cube  interconnection  networks  are  node-limited  rather  than  wire- 
limited.  In  these  networks,  node  delay  and  pin  limitations  dominate  wire  delay  and  wire  density 
limitations.  The  ratios  of  node  delay  to  wire  delays  and  pin  density  to  wire  density  cannot  be 
balanced  in  a  regular  fc-ary  n-cube. 

Express  cubes  overcome  this  problem  by  allowing  wire  length  and  wire  density  to  be  adjusted 
independently  of  the  choice  of  radix,  fc,  dimension,  n,  and  channel  width,  W.  An  express  cube 
is  a  fc-ary  n-cube  augmented  by  one  or  more  levels  of  express  channels  that  allow  non-local 
messages  to  bypass  nodes.  The  wire  length  of  the  express  channels  can  be  increased  to  the 
point  that  wire  delays  dominate  node  delays.  The  number  of  express  channels  can  be  adjusted  to 
increase  throughput  until  the  available  wiring  media  is  saturated.  This  ability  to  balance  node  and 
wire  limitations  is  achieved  without  sacrificing  the  wire-efi  -  n.zv  of  fc-ary  n-cube  networks.  The 
number  of  channels  traversed  by  a  message  in  a  hierarchic;  .-ess  cube  grows  logarithmically 
with  distance  as  in  a  multistage  interconnection  network  [llj[i.*j.  The  express  cube,  however,  is 
able  to  exploit  locality  while  in  a  multistage  network  all  messages  must  traverse  the  diameter  of 
the  network.  With  an  express  cube,  both  latency  and  throughput  axe  wire  limited  and  axe  within 
a  small  constant  factor  of  the  physical  limit  on  performance. 

The  remainder  of  this  paper  describes  the  express  cube  topology  and  analyzes  its  performance. 
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Section  2  summarizes  the  notation  that  will  be  used  throughout  the  paper.  Section  3  introduces 
the  express  cube  topology  in  steps.  Basic  express  cubes  (Section  3.1)  reduce  latency  to  twice  the 
delay  of  a  dedicated  wire  for  messages  traveling  long  distances.  Throughput  can  be  increased  to 
saturate  the  available  wiring  density  by  adding  multiple  express  channels  (Section  3.2).  With  a 
hierarchical  express  cube  (Section  3.3),  latency  for  short  distances,  while  node-limited,  is  within 
a  small  constant  factor  of  the  best  that  can  be  achieved  by  any  bounded  degree  network.  Some 
design  considerations  for  express  cube  interchanges  are  discussed  in  Section  4. 


2  Notation 

The  following  symbols  are  used  in  this  paper.  They  axe  listed  here  for  reference. 

C,  the  set  of  channels  in  the  network. 

D ,  manhattan  distance  traveled  by  a  message,  \x,  -  Xd\  +  |y»  —  yd\  +  \z,  —  Zd |,  where 
the  source  is  at  (i„sf()r#)  and  the  destination  is  at  (xj,? idizd)- 

H  hops ,  the  number  of  nodes  traversed  by  a  message, 
i,  number  of  nodes  between  interchanges  in  an  express  cube. 
k ,  the  radix  of  the  network  -  the  length  in  each  dimension. 

/,  the  number  of  levels  of  hierarchy  in  a  hierarchical  express  cube. 

X,  the  message  length  in  bits, 
n,  the  dimension  of  the  network. 

N ,  the  set  of  nodes  in  the  network.  Where  it  is  unambiguous,  N  is  also  used  for  the 
number  of  nodes  in  the  network,  |JVj. 

Tn,  the  latency  of  a  node. 

Tw,  the  latency  of  a  wire  that  connects  two  physically  adjacent  nodes. 

Tp,  the  pipeline  period  of  a  node. 

W ,  the  width  of  a  channel  in  bits. 

a,  the  ratio  of  node  latency  to  wire  latency,  Tn/Tw. 

Communication  between  nodes  is  performed  by  sending  messages.  A  message  may  be  broken 
into  one  or  more  packets  for  transmission.  A  packet  is  the  smallest  unit  that  contains  routing 
and  sequencing  information.  Packets  contain  one  or  more  flow  control  digits  or  flits.  A  flit  is 
the  smallest  unit  on  which  flow  control  is  performed.  A  flit  in  turn  is  composed  of  one  or  more 
physical  transfer  units  or  phits2 .  A  phit  is  W-bits,  the  size  of  the  physical  communication  media. 

An  interconnection  network  consists  of  a  set  of  nodes,  N,  that  are  connected  by  a  set  of  channels, 
C  C  N  x  N .  Each  channel  is  unidirectional  and  carries  data  from  a  source  node  to  a  destination 
node.  For  the  purposes  of  this  paper  it  is  assumed  that  the  network  is  bidirectional:  channels 
occur  in  pairs  so  that  (ni.nj)  6  C  =>  (na,nx)  6  C. 

JThere  is  no  constraint  that  the  physical  unit  of  transfer,  phit,  mast  be  smaller  than  the  flow  control  nnit,  flit. 
It  is  possible  to  construct  systems  with  several  flits  in  each  phit. 
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Figure  1:  Insertion  of  express  channels  reduces  latency:  (A)  A  regular  Jb-ary  1-cube  network  may 
be  dominated  by  node  delay,  (B)  A  ic-ary  1-cube  with  express  channels  reduces  the  node  delay 
component  of  latency. 


3  Express  Cubes 

3.1  Express  Channels  Reduce  Latency 

Figure  1  illustrates  the  application  of  express  channels  to  a  Jb-ary  1-cube  or  linear  array.  A  regular 
fc-ary  1-cube  is  shown  in  Figure  1A.  The  network  is  linear  array  of  k  processing  nodes,  labeled  N, 
each  connected  to  its  nearest  neighbors  by  channels  of  width  W.  The  delay  of  a  phit  propagating 
through  a  node  is  Tn.  The  delay  of  the  wire  connecting  two  nodes  is  Tw.  Each  channel  can  accept 
a  new  phit  every  Tp.  The  latency  of  a  message  of  length  L  sent  distance  D  is 

Ta  =  3Tn  +  DTW  +  ^ Tv  =  ( Tn  +  TW)D  +  ^Tv.  (1) 

Message  latency  is  composed  of  three  components  as  shown  in  equation  (1).  The  first  component 
is  the  node  latency,  due  to  the  number  of  hops,  3.  The  second  component  is  the  wire  latency,  due 
to  the  distance  D.  The  third  component  is  due  to  message  length,  L.  For  a  conventional  Jb-ary 
n-cube,  3  —  D  and  since  for  most  networks  Tn  »  Tw,  the  node  latency  dominates  the  wire 
latency.  Express  cubes  reduce  the  node  latency  by  increasing  wire  length  to  reduce  the  number 
of  hops,  3. 

An  express  k- ary  1-cube  is  shown  in  Figure  IB.  Express  channels  have  been  added  to  the  array 
by  inserting  an  interchange,  labeled  I,  every  i  nodes.  An  interchange  is  not  a  processing  node. 
It  performs  only  communication  functions  and  is  not  assigned  an  address.  Each  interchange 
is  connected  to  its  neighboring  interchanges  by  an  additional  channel  of  width  Wy  the  express 
channel.  When  a  message  arrives  at  an  interchange  it  is  routed  directly  to  the  next  interchange  if 
it  is  not  destined  for  one  of  the  intervening  nodes.  To  preserve  the  wire-efficiency  of  the  network, 
messages  are  never  routed  past  their  destinations  on  the  express  channels  even  though  doing  so 
would  reduce  3  in  many  cases. 

The  delay,  Tn »  and  throughput,  1/TP,  of  an  interchange  are  assumed  to  be  identical  to  those  of 
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a  node.  The  wire  delay  of  the  express  channel  is  assumed  to  be  iTw.  To  simplify  the  following 
analysis,  it  is  assumed  that  interchanges  add  no  physical  distance  to  the  network.  Assuming  i\D, 
H  =  D/i  +  i  and  insertion  of  express  channels  reduces  the  latency  to 

Tb=  (j  +  i)rn  +  r*x?  +  ^.  (2) 

In  the  general  case,  an  average  message  traversing  D  processing  nodes  travels  over  =  (t  +  l)/2 
local  channels  to  reach  an  interchange,  Et  =  [D/i  —  1/2  +  l/(2t)J  express  channels  to  reach  the 
last  interchange  before  the  destination,  and  finally  H /  =  (D  -  i/2  +  1/2)  mod  i  local  channels  to 
the  destination.  The  total  number  of  hops  is  H  =  Hi  +  Ht  -f  Hj  giving  a  latency  of 


i  +  l)modi))Tn  +  Dr„  +  SL. 


(3) 


For  large  distances,  D  >>  a  =  Tn/Tw,  choosing  i  =  a  balances  the  node  and  wire  delay.  With 
this  choice  of  i,  the  latency  due  to  distance  is  approximately  twice  the  wire  latency,  Tp  ss  2TWD. 
The  latency  for  large  distances  of  large  express  channel  network  with  i  =  a  is  within  a  factor  of 
two  of  the  latency  of  a  dedicated  manhattan  wire  between  the  source  and  destination3. 

For  small  distances  or  large  a,  the  i  term  in  the  coefficient  of  T„  in  equation  (2)  is  significant  and 
node  delay  dominates.  For  such  networks,  latency  is  minimized  by  choosing  i  —  VD  resulting  in 
Tp  «  2(y/D  —  1)T„.  The  use  of  hierarchical  express  channels  (Section  3.3)  can  further  improve 
the  latency  for  small  distances. 


3.2  Multiple  Express  Channels  Increase  Throughput  to  Saturate  Wire  Density 

To  first  order,  network  throughput  is  proportional  to  wire  bisection  and  hence  wire  density.  If  more 
wires  are  available  to  transmit  data  across  the  network,  throughput  will  be  increased  provided 
that  routing  and  flow  control  strategies  are  able  to  profitably  schedule  traffic  onto  these  wires. 
Many  regular  network  topologies,  such  as  low- dimensional  k- ary  n- cubes,  are  unable  to  make  use 
of  all  available  wire  density  because  of  pin  limitations.  The  wire  bisection  of  an  express  cube  can 
be  controlled  independent  of  the  choice  of  radix,  k,  dimension,  n,  or  channel  width,  W  by  adding 
multiple  express  channels  to  the  network  to  match  network  throughput  with  the  available  wiring 
density. 

Figure  2  shows  two  methods  of  inserting  multiple  express  channels.  Multiple  express  channels 
may  be  handled  by  each  interchange  as  shown  in  Figure  2 A.  Alternatively,  simplex  interchanges 
can  be  interleaved  as  shown  in  Figure  2B. 

In  method  A,  using  multiple  channel  interchanges,  an  interchange  is  inserted  every  i  nodes  as  above 
and  each  interchange  is  connected  to  its  neighbors  using  m  parallel  express  channels.  Figure  2A 
shows  a  network  with  i  ss  4  and  m  =  2.  The  interchange  acts  as  a  concentrator  combining 

’There  is  nothing  special  about  the  factor  of  two.  By  choosing  i  *  ja  the  distance  component  of  latency  will 
be  (1  +  1/j)  times  the  latency  of  a  manhattan  wire. 
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Figure  2:  Multiple  express  channels  allow  wire  density  to  be  increased  to  saturate  the  available 
wiring  media.  Express  channels  can  be  added  using  either  (A)  interchanges  with  multiple  express 
channels,  or  (B)  interleaved  simplex  interchanges. 


messages  arriving  on  the  m  incoming  express  channels  with  non-local  messages  arriving  on  the 
local  channel  and  concentrating  these  message  streams  onto  the  m  outgoing  express  channels. 
This  method  has  the  advantage  of  making  better  use  of  the  express  channels  since  any  message 
can  route  on  any  express  channel.  Flexibility  in  express  channel  assignment  is  achieved  at  the 
expense  of  higher  pincount  and  limited  expansion. 

With  method  B,  interleaving  simplex  interchanges,  m  simplex  interchanges  are  inserted  into  each 
group  of  t  nodes.  Each  interchange  is  connected  to  the  corresponding  interchange  in  the  next  group 
by  a  single  express  channel.  All  messages  from  the  nodes  immediately  before  an  interchange  will  be 
routed  on  that  interchange’s  express  channels.  Because  load  cannot  be  shared  among  interleaved 
express  channels,  an  uneven  distribution  of  traffic  may  result  in  some  channels  being  saturated 
while  parallel  channels  are  idle.  Method  B  has  the  advantage  of  using  simple  interchanges  and 
allowing  arbitrary  expansion.  In  the  extreme  case  of  inserting  an  interchange  between  every  pair 
of  nodes  the  resulting  topology  is  almost  the  same  as  the  topology  that  would  result  from  doubling 
the  number  of  dimensions. 

Both  of  the  methods  illustrated  in  Figure  2  have  the  effect  of  increasing  the  wire  density  (and 
bisection)  by  a  factor  of  m  +  1.  To  first  order,  network  throughput  will  increase  by  a  similar 
amount.  There  will  be  some  degradation  due  to  uneven  loading  of  parallel  channels. 

The  use  of  multiple  express  channels  offsets  the  load  imbalance  between  express  and  local  channels. 
If  traffic  is  uniformly  distributed,  the  average  fraction  of  messages  crossing  a  point  in  the  network 
on  a  local  channel  is  Pi  =  2 i/k  as  compared  to  Pe  =  (k  —  2 i)/k  crossing  on  an  express  channel. 
For  large  networks  where  k  »  t,  the  bulk  of  the  traffic  is  on  express  channels.  Increasing  the 
number  of  express  channels  applies  more  of  the  network  bandwidth  where  it  is  most  needed. 

Multiple  express  channels  are  an  effective  method  of  increasing  throughput  in  networks  where  the 
channel  width  is  limited  by  pinout  constraints.  For  example,  in  the  J-Machine  the  channel  width, 
W  =  9,  is  set  by  pin  limitations4.  The  printed-circuit  board  technology  is  capable  of  running  80 

4Each  J-Machine  node  in  packaged  in  a  168-pin  PGA.  The  rix  communication  channel*  each  require  9  data  bit* 
and  6  control  bits  conanming  90  of  these  pins.  Power  connections  use  48  pins.  The  remaining  30  pins  are  nsed  by 


6 


12 

_ r 

“i _ 

12 

_ r 

T 

liSiSB 

5555SS 

B 

Figure  3:  Hierarchical  express  channels  reduce  latency  due  to  local  routing. 


■wires  in  each  dimension  across  the  50mm  width  of  a  node.  Even  with  many  of  these  wires  used 
for  local  connections,  four  parallel  15-bit  (data+control)  wide  channels  can  be  easily  run  across 
each  node.  A  multiple  express  channel  network  with  m  =  3  could  use  this  available  wire  density 
to  quadruple  the  throughput  of  the  network. 


3.3  Hierarchical  Express  Cubes  Have  Logarithmic  Node  Delay 

With  a  single  level  of  express  channels,  an  average  of  »  local  channels  are  traversed  by  each 
non-local  message.  The  node  delay  on  these  local  channels  represents  a  significant  component 
of  latency  and  causes  networks  with  short  distances,  D  <  a3,  to  be  node  limited.  Hierarchical 
express  cubes  overcome  this  limitation  by  using  several  levels  of  express  channels  to  make  node 
delay  increase  logarithmically  with  distance  for  short  distances. 

The  use  of  hierarchical  express  channels,  shown  in  Figure  3,  reduces  the  latency  due  to  node 
delay  on  local  channels.  With  hierarchical  express  channels,  there  are  /  levels  of  interchanges.  A 
first-level  interchange  is  inserted  every  t  nodes.  A  second-level  interchange  replaces  every  tth  first 
level  interchange,  every  »3  nodes.  In  general,  a  jtb  level  interchange  replaces  every  Ith  j  —  1“  level 
interchange,  every  tJ  nodes5.  Figure  3  illustrates  a  hierarchical  express  cube  with  *  =  2,  /  =  2. 

A  j**1  level  interchange  has  j  -f  1  inputs  and  j+ 1  outputs.  Arriving  messages  are  treated  identically 
regardless  of  the  input  on  which  they  arrive.  Messages  that  are  destined  for  one  of  the  next  i 
nodes  are  routed  to  the  local  (0th)  output.  Those  remaining  messages  that  are  destined  for  one 
of  the  next  i3  nodes  are  routed  to  the  1'*  output.  The  process  continues  with  all  messages  with 
a  destination  between  ip  and  ip+l  nodes  away,  0  <  p  <  j  —  1,  routed  to  the  p**  output.  All 
remaining  messages  are  routed  to  the  j-th  output. 

A  message  in  a  hierarchical  express  cube  is  delivered  in  three  phases:  ascent,  cruise,  and  descent. 
In  the  ascent  phase,  an  average  message  travels  (i  +  l)/2  hops  to  get  to  the  first  interchange, 
and  (*  -  l)/2  hops  at  each  level  for  a  total  of  Ha  =  (*  -  l)//2  +  1  hops  and  a  distance  of 
Da  =  (il  —  l)/2.  During  the  cruise  phase,  a  message  travels  Hc  —  [(D  -  Da)/il\  hops  on  level 

external  memory  interface  and  control. 

*Thia  construction  yields  a  fixed-radix  express  cube,  with  radix  i  for  each  level.  It  is  also  possible  to  construct 
mixed-radix  express  cubes  where  the  radix  varies  from  level  to  level. 
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Figure  4:  Hierarchical  interchanges  (A)  a  third-level  interchange.  (B)  a  third-level  interchange 
implemented  from  first-level  interchanges.  (C,D)  With  a  small  performance  penalty,  ascending 
and/or  descending  interchanges  can  be  eliminated. 


I  channels  for  a  distance  of  De  =  ilHc.  Finally,  the  message  descends  back  through  the  levels 
routing  on  each  level,  j,  as  long  as  the  remaining  distance  is  greater  than  i1 .  For  the  special  case 
where  il\D,  the  descending  message  takes  Ed  =  (*  — l)f/2+l  hops  for  a  distance  of  Dd  =  (i,  +  l)/2. 
This  gives  a  latency  of 


r.=  (f  +  (i-i)(  +  i)rn  +  r„c  +  i5:.  (4) 

Choosing  t  and  /  so  that  il  =  a  balances  node  and  wire  delay  for  large  distances.  With  this  choice, 
the  delay  due  to  local  nodes  is  (i  —  1)IT„  =  (» —  1)  log,-  qT„  which  is  a  minimum  for  i  =  e.  While  3 
is  the  closest  integer  to  e,  a  choice  of  i  =  4  is  preferred  to  facilitate  decoding  of  binary  addresses 
in  interchanges,  and  networks  with  t  =  8  or  i  =  16  may  be  desirable  under  some  circumstances. 

In  the  general  case,  i1  J(D,  the  latency  of  a  hierarchical  express  cube  is  calculated  by  representing 
the  source  and  destination  coordinates  as  h  =  log,-  k-digit  radix- i  numbers,  S  =  •  •  •  So,  and 

D  —  dh-\  •••do-  WLOG  we  assume  that  5  <  D.  During  the  ascent  phase,  a  message  routes 
from  5  to  s^_x  •  •  •  sj+10  •  •  •  0  taking  Ha  =  £*-"q  ((*  -  a,)  mod  i)  hops  for  a  distance  of  Da  — 
((*'  ~  ai)  mod  i)V.  The  cruise  phase  takes  the  message  ffc  =  i^i  ~  tops  for  a 

distance  of  De  =  Heil.  Finally,  the  descent  phase  takes  the  message  from  •••d/0---0  to  D 
taking  Ed  -  d,  hops  for  a  distance  of  Dd  ~  \djV.  For  short  distances  the  cruise  phase 
will  never  be  reached.  The  message  will  move  from  ascent  to  descent  as  soon  as  it  reaches  a  node 
where  all  non-zero  coordinates  agTee  with  D.  The  total  latency  for  the  general  case  is  plotted  as 
a  function  of  distance  in  Figure  5. 

Figure  4  shows  how  hierarchical  interchanges  can  be  implemented  using  pin-bounded  modules.  A 
level-  j  interchange  requires  j  + 1  inputs  and  outputs  if  implemented  as  a  single  module  as  shown 
for  a  third  level  interchange  in  Figure  4A.  A  level- j  interchange  can  be  decomposed  into  2 j  -  1 
level-one  interchanges  as  shown  for  j  =  2  in  Figure  4B.  A  series  of  j  -  1  ascending  interchanges 
that  route  non-local  traffic  toward  higher  levels  is  followed  by  a  top-level  interchange  and  a  series 
of  j  —  1  descending  interchanges  that  allow  local  traffic  to  descend.  With  some  degradation  in 
performance,  the  ascending  interchanges  can  be  eliminated  as  shown  in  Figure  4C.  This  change 
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Distance 


Delay  vs.  Distance  for  Express  Cubes 

Figure  5:  Latency  as  a  function  of  distance  for  a  hierarchical  express  channel  cube  with  *  =  4, 
/  a=  3,  a  =  64,  and  a  flat  express  channel  cube  with  i  =  16,  a  =  64.  In  a  hierarchical  express 
channel  cube  latency  is  logarithmic  for  short  distances  and  linear  for  long  distances.  The  crossover 
occurs  between  D  =  a  and  D  =  ia  log^a.  The  flat  cube  has  linear  delay  dominated  by  T»  for 
short  distances  and  by  Tw  for  long  distances. 


requires  extra  hops  in  some  cases  as  a  message  cannot  skip  levels  on  its  way  up  to  a  high-level 
express  channel.  Each  message  must  traverse  at  least  one  level  j  —  1  channel  before  being  switched 
to  a  level- j  channel.  By  restricting  messages  to  also  travel  on  at  least  one  channel  at  each  level 
as  they  descend,  the  descending  interchanges  can  be  eliminated  as  well  leaving  only  the  single 
top-level  interchange  as  shown  in  Figure  4D. 

3.4  Performance  Comparison 

Figure  5  shows  how  latency  varies  with  distance  in  hierarchical  and  flat  express  cubes  and  com¬ 
pares  these  latencies  with  the  latency  of  a  conventional  k- ary  1-cube  and  of  a  direct  wire.  These 
curves  assume  that  the  message  source  is  midway  between  two  interchanges.  The  latencies  are 
normalized  to  units  of  the  wire  delay  between  adjacent  nodes.  The  latency  of  a  conventional  fc-ary 
1-cube  is  Unear  with  slope  a  while  the  latency  of  a  wire  is  Unear  with  slope  1. 
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Figure  6:  A  multidimensional  express  cube  may  be  constructed  either  by  (A)  inserting  inter¬ 
changes  into  each  dimension  separately,  or  (B)  interleaving  multi-dimensional  interchanges  into 
the  array. 


For  short  distances,  until  the  first  express  channel  is  reached,  a  flat  (non-hierarchical)  express  cube 
has  the  same  delay  as  a  conventional  k- ary  n-cube,  To  =  aD.  Once  the  message  begins  traveling 
on  express  channels,  latency  increases  linearly  with  slope  1  +  a/t.  This  occurs  at  distance  D  —  24 
in  the  figure.  There  is  a  periodic  variation  in  delay  around  this  asymptote  due  to  the  number  of 
local  channels  being  traversed,  Aocal  =  (*  +  l)/2  +  (( D  -  ill  +  1/2)  mod  »'). 

The  hierarchical  express  cube  has  a  latency  that  is  logarithmic  for  short  distances  and  linear  for 
long  distances.  The  latency  of  messages  traveling  a  short  distance,  D  <  a  is  node  limited  and 
increases  logarithmically  with  distance,  To  «  (*'-  l)Iog,  DT„.  This  delay  is  within  a  factor  of »  - 1 
of  the  best  that  can  be  achieved  with  radix  t  switches.  Long  distance  messages  have  a  latency 
of  To  «  (1  +  a/i‘)Tw.  If  =  a,  this  long  distance  latency  is  approximately  twice  the  latency 
of  a  dedicated  manhattan  wire.  In  a  hierarchical  network,  the  interchange  spacing,  i,  can  be 
made  small,  giving  good  performance  for  short  distances,  without  compromising  the  delay  of  long 
distance  messages  which  depends  on  the  ratio  a/i1 .  In  a  fiat  network  with  a  single  parameter,  i, 
it  is  not  possible  to  simultaneously  optimize  performance  for  both  short  and  long  distances. 


3.5  Express  Channels  in  Many  Dimensions 

A  multidimensional  express  cube  may  be  constructed  by  inserting  interchanges  into  each  dimen¬ 
sion  separately  as  shown  in  Figure  6A.  The  figure  shows  part  of  a  two-dimensional  express  cube 
with  i  —  4,  l  =  1.  Interchanges  have  been  inserted  separately  into  the  X  and  Y  dimensions.  A 
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Figure  7:  Interchanges  allow  wire  density,  speed,  and  signalling  levels  to  be  changed  at  module 
boundaries. 


similar  construction  can  be  realized  for  higher  dimensions  and  for  hierarchical  networks.  With  this 
approach  interchange  pin-count  is  minimal  as  each  interchange  handles  only  a  single  dimension. 
Also,  the  design  is  easy  to  package  into  modules  as  the  interchanges  axe  located  in  regular  rows 
and  columns.  This  approach  has  the  disadvantage  that  messages  must  descend  to  local  channels 
to  switch  dimensions. 

An  alternate  construction  of  a  multidimensional  express  cube  is  to  interleave  multidimensional 
interchanges  into  the  array  as  shown  in  Figure  6B  for  *  =  4,  /  =  1.  This  approach  allows  messages 
on  express  channels  to  change  dimensions  without  descending  to  a  local  channel.  It  is  particularly 
useful  in  networks  that  use  adaptive  routing  [13][14]  as  it  provides  alternate  paths  at  each  level  of 
the  network.  The  interleaved  construction  has  the  disadvantages  of  requiring  a  higher  interchange 
pincount  and  being  more  difficult  to  package  into  modules. 


3.6  Modularity 

The  interchanges  in  an  express  cube  can  be  used  to  change  wire  density,  speed,  and  signalling 
levels  at  module  boundaries  as  shown  in  Figure  7.  Large  networks  are  built  from  many  modules 
in  a  physical  hierarchy.  A  typical  hierarchy  includes  integrated  circuits,  printed  circuit  boards, 
chassis,  and  cabinets.  Available  wire  density  and  bandwidth  change  significantly  between  levels 
of  the  hierarchy.  For  example,  a  typical  integrated  circuit  has  a  wire  density  of  250  wires/mm  per 
layer  while  a  printed  circuit  board  can  handle  only  2  wires/mm  per  layer6.  Interchanges  placed  at 
module  boundaries  as  shown  in  Figure  7  can  be  used  to  vary  the  number  and  width  of  express  and 
local  channels.  These  boundary  interchanges  may  also  convert  internal  module  signalling  levels 
and  speeds  to  levels  and  speeds  more  appropriate  between  modules.  Using  express  rhannoU  and 
boundary  interchanges,  the  network  can  be  adjusted  to  saturate  the  available  wiring  density  even 
though  this  density  is  not  uniform  across  the  packaging  hierarchy.  To  make  use  of  the  available 
bandwidth,  computations  running  on  the  network  must  exploit  locality. 

'This  integrated  circuit  wire  density  is  typical  of  first-level  metal  in  a  lp  CMOS  process.  The  printed  circuit 
wire  density  is  for  a  board  with  8 mil  wires  and  spaces.  Both  densities  assume  all  area  is  available  for  wiring. 
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Figure  8:  Block  diagram  of  an  interchange.  Two  multiplexors  perform  switching  between  input 
and  output  registers  based  on  a  comparison  of  the  high  address  bits  in  a  message  header. 


4  Interchange  Design 

Figure  8  shows  the  block  diagram  of  a  unidirectional  interchange.  A  bidirectional  interchange 
includes  an  identical  circuit  in  the  opposite  direction.  The  basic  design  is  similar  to  that  of  a 
router  [15][6][3].  Two  input  latches  hold  arriving  flits  and  two  output  latches  hold  departing  flits. 
If  additional  buffering  is  desired,  any  of  these  latches  may  be  replaced  by  a  FIFO  buffer.  If  a  phit 
is  a  different  size  than  a  flit,  multiplexing  and  demultiplexing  is  required  between  the  flit  buffers 
and  the  interchange  pins.  Associated  with  each  output  latch  is  a  multiplexor  that  selects  which 
input  is  routed  to  the  latch.  Routing  decisions  are  made  by  comparing  the  address  information 
in  the  head  flit(s)  of  the  message  to  the  local  address.  If  the  destination  lies  within  the  next  t 
nodes,  the  local  channel  is  chosen,  otherwise  the  express  channel  is  chosen.  If  *  is  a  power  of  two, 
interchanges  are  aligned,  and  absolute  addresses  are  used  in  headers,  the  comparison  can  be  made 
by  checking  all  but  the  /log2 1  least  significant  bits  for  equality  to  the  local  address. 

The  interchange  state  includes  presence  bits  for  each  register,  an  input  state  for  each  input,  and 
an  output  state  for  each  output.  The  presence  bits  are  used  for  flit-level  flow  control.  A  flit  is 
allowed  to  advance  only  if  the  presence  bit  of  its  destination  register  is  clear  (no  data  present),  or 
if  the  register  is  to  be  emptied  in  the  same  cycle.  The  input  state  bits  hold  the  destination  port 
and  status  (empty,  head,  advancing,  blocked)  of  the  message  currently  using  each  input.  The 
output  state  consists  of  a  bit  to  identify  whether  the  output  is  busy  and  a  second  bit  to  identify 
which  input  has  been  granted  the  output.  The  combinational  logic  to  maintain  these  state  bits 
and  control  the  data  path  is  straightforward. 
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5  Conclusion 


Express  cubes  are  k- ary  n-cubes  augmented  by  express  channels  that  provide  a  short  path  for  non¬ 
local  messages.  An  express  cube  retains  the  wire  efficiency  of  a  conventional  lb-ary  n-cube  while 
providing  improved  latency  and  throughput  that  are  limited  only  by  the  wire  delay  and  available' 
wire  density.  For  short  distances,  a  hierarchical  express  cube  has  a  latency  that  is  within  a  small 
factor  of  the  best  that  can  be  achieved  with  a  bounded  degree  network.  For  long  distances,  the 
latency  can  be  made  arbitrarily  close  to  that  of  a  dedicated  manhattan  wire.  Multiple  express 
channels  can  be  used  to  increase  throughput  to  the  limit  of  the  available  wire  density.  The  express 
cube  combines  the  low  diameter  of  multistage  interconnection  networks  with  the  wire  efficiency 
and  ability  to  exploit  locality  of  a  direct  network.  The  result  is  a  network  with  latency  and 
throughput  that  are  within  a  small  factor  of  the  physical  limit. 

Express  channels  are  added  to  a  k- ary  n-cube  by  periodically  inserting  interchanges  into  each 
dimension.  No  modifications  are  required  to  the  routers  in  each  processing  node;  express  channels 
can  be  added  to  most  existing  k- ary  n-cube  networks.  Interchanges  also  allow  wire  density,  speed, 
and  signalling  levels  to  be  changed  at  module  boundaries.  An  express  cube  can  make  use  of  all 
available  wire  density  even  if  the  wire  density  is  non-uniform.  This  is  often  required  as  the  wire 
density  and  speed  may  change  significantly  between  levels  of  packaging. 

Express  cubes  achieve  their  performance  at  the  cost  of  adding  interchanges,  increasing  the  latency 
for  some  short-distance  messages,  and  increasing  the  bisection  width  of  the  network.  Each  inter¬ 
change  adds  a  component  to  the  system  and  increases  the  latency  of  local  messages  that  cross  an 
interchange  but  do  not  take  the  express  channel  by  one  node  delay,  ( Tn  +  Tw).  Express  channels 
increase  the  wire  bisection  by  using  available  unused  wiring  capacity.  In  parts  of  the  network  that 
are  already  wire-limited  the  express  and  local  channels  can  be  combined  as  shown  in  Figure  7. 

As  the  performance  of  interconnection  networks  approaches  the  limits  of  the  underlying  wiring 
media  their  range  of  application  increases.  These  networks  can  go  beyond  exchanging  messages 
between  the  nodes  of  concurrent  computers  to  serving  as  a  general  interconnection  media  for 
digital  electronic  systems.  For  distances  larger  than  D'  =  crilog.a,  the  delay  of  a  hierarchical 
express  cube  network  is  within  a  factor  of  three  of  that  of  a  dedicated  wire.  The  network  may 
provide  better  performance  than  the  wire  because  it  is  able  to  share  its  wiring  resources  among 
many  paths  in  the  network  while  a  dedicated  wire  serves  only  a  single  source  and  destination. 
For  distances  smaller  than  D',  dedicated  wiring  offers  a  significant  latency  advantage  at  the  cost 
of  eliminating  resource  sharing. 
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